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INTRODUCTION 

Multiple regression models, and the logistic regression 
model in particular, have a long history in the analysis of 
epidemiological studies of multifactorial disease. Since its 
introduction by Truett et al. (1967), this has become the pre- 
eminent statistical tool of the epidemiologist and only in 
the 1980s were other models explored in a systematic way, 
to examine the possibility of discriminating between dif- 
ferent mechanistic models. A framework for such a formal 
study was provided by the idea of generalized linear models 
(GLMs) (McCullagh and Nelder, 1989; Nelder and Wedder- 
burn, 1972) — regression models in which the scale on which 
effects of covariates combine in an additive manner can be 
varied by choice of a "link function." 

Just as epidemiologists in the late 1960s had to address 
the problem of multifactorial influences on disease, geneti- 
cists faced the same problems in addressing the genetics of 
"complex" traits in which disease is influenced by a mul- 
tiplicity of genetic and environmental causes. In particular, 
Risch (1990) considered two different models for the man- 
ner in which multiple genes act together to influence risk of 
a complex disease trait and explored their implications in 
the context of linkage analysis. However, although the prob- 
lems faced by geneticists and epidemiologists in studying 
multifactorial disease have much in common, there has been 
remarkably little reference to the earlier epidemiological lit- 
erature in the more recent genetic literature (Clayton, 2009). 

In this paper, these issues are reviewed in the context of 
modern genetic association studies, starting with the issue 



of whether it is necessary to allow for known risk factors 
when testing for new genetic associations. It is shown that 
the optimal test strategy depends on model which holds and 
that the concept of the "link function" of GLMs captures this 
dependence elegantly. Later sections deal with implications 
of the link function for interpretation in terms of mecha- 
nism, and for the potential to predict disease outcome from 
genotype. 

TESTING FOR ASSOCIATION IN 
THE PRESENCE OF KNOWN RISK 
FACTORS 

This section deals with the problem of testing for asso- 
ciation in the presence of known, strongly predictive, risk 
factors. For simplicity, it is assumed that existing factors can 
be represented by a discrete stratification, although a full 
regression generalization follows naturally. We start by re- 
visiting the theory of the conventional stratified test within 
the general framework of the logistic regression model. 

THE LOGISTIC REGRESSION MODEL 
AND CASE-CONTROL STUDIES 

We shall first introduce some notation. Let Y — 1 denote 
presence of disease and Y = 0 its absence, let X represent 
a single variable of interest and let Z represent covariates. 
Here, Z will simply represent a discrete variable describing 
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risk strata, but the results presented may readily be gener- 
alized to the full regression case. If S represents the fact of 
being sampled in a case-control study and if S is indepen- 
dent of X and Z, Bayes theorem gives 

Pr(Y = l\X,Z,S)_ Pr(S | Y = 1) Pr(Y = 1 1 X, Z) 
Pr(Y = 0 | X, Z, S) ~ Pr(S | Y = 0) Pr(Y = 0 | X, Z) ' 

The first term on the right-hand side, which will be denoted 
by K, is the ratio of sampling rates of cases and non-cases 
or, equivalently 



K 



Pr(S | Y = 1) 
Pr(S | Y = 0) 

Ratio of cases to controls in the study 
Marginal odds of disease in the population ' 



In general, K will be unknown, but it can usually be esti- 
mated approximately and, for the purpose required here, 
only an approximate estimate will be necessary. Taking log- 
arithms, 

, Pr(Y = 1 1 X, Z, S) , rx , Pr(Y=l|X, Z) 



Pr(Y = 0 | X, Z, S) 



Pr(Y = 0 | X, Z) 



giving the well known result that, if a logistic regression 
model holds for disease in the population: 



log 



Pr(Y = l\X = X,Z = z) 
Pr(Y = 0 | X = x, Z = z) 



a + (3x + -y z 



then a logistic regression model will also hold in the case- 
control study, with all coefficients unchanged save for the 
intercept, which becomes a + log K . It is this result which 
has led to logistic regression having become the preeminent 
statistical tool of modern epidemiology. 

The logistic regression model is a GLM with binomial 
(Bernoulli) error structure and logit link function and, since 
this link function is the "canonical" link in this setting, the 
first derivative of the log-likelihood with respect to the 
parameter (3 takes a very simple form. For observations 
{yi,Xi,Zj-,i — 1 . . . N), this is 



31ogL 
3P 



where |x, is the expectation of y, given by the model, given 
the values of %\ and z ; . This expression gives the score test 
statistic for association between disease and exposure in 
the presence of stratification, proposed by Mantel (1963) 
as an extension to the Cochran- Armitage test for trend in 
proportions (Armitage, 1955; Cochran, 1954); one simply 
evaluates the above expression at the maximum-likelihood 
estimate of the remaining model parameters, having set 
P = 0. This simply replaces |x, by the stratum means of 
y. More usually, the test is written in terms of the three- 
dimensional contingency table of frequencies, f yxz , 

2 X 

where a dot subscript denotes summation, and e yxz denote 
the "expected" frequencies under the null hypothesis of 
condition independence of Y and X given Z: 
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_ fxjzfxz 
Syxz — , 
J-z 

If we denote by y z and x z the vectors of observations in 
stratum z (where the correspondence of y and x observa- 
tions of the same subject is maintained), the test may also 
be written as 



where x z y z are the stratum specific means of x and y, re- 
spectively, and N z — f.. z , the number of subjects in stratum 
z. The variance formula given by Mantel is based on an ar- 
gument which conditions on the marginal frequency tables 
f v . z and f. xz . This is simply the permutation variance of U 
under random permutations of the order of the elements of 
x z and y z . Thus, 



Var(li) = ^^ T S 2 (x z )S 2 (y i5 ), 



where S 2 () denotes the (biased) sample variance of its vec- 
tor argument, S 2 (x) — ^\ X f/N — x 2 . An "exact" test can be 
generated by the same permutation argument. 

In the context of the case-control study, the logistic regres- 
sion argument seems strange in that it conditions on X and 
Z and treats Y as a random outcome, while the study design 
suggests the reverse. However, Mantel's test conditions on 
both margins of the X x Y contingency table within each 
stratum, thus generalizing the argument of Fisher's exact 
test. That the asymptotic properties of the logistic regres- 
sion approach to the analysis of case-control studies remain 
in force more generally, despite the reversal of conditioning 
was established by Prentice and Pyke (1979). 

Exactly the same test can be generated from the stand- 
point of the log-linear model for contingency tables (Bishop 
et al., 1975; Goodman, 1970). The logistic regression model 
described above is implied by a log-linear model for the ex- 
pected values of the frequencies, {f yxz }, which includes all 
first-order associations: 

lo g E (fyxz) = >lv + <Kz + $xy. 

The parameter p in this model corresponds precisely with 
the parameter p of the logistic regression model above. The 
score test of H 0 : p = 0 is as before, and its null distribution 
conditions on the sufficient statistics for \\i and 4>, i.e., upon 
the marginal tables f y . z and f. xz . This corresponds to the 
same permutation argument as described above. 

In genome-wide association studies, X represents a sin- 
gle nucleotide polymorphism (SNP) and has three levels. It 
is coded numerically, with the heterozygous genotype (Aa, 
say) coded at the midpoint between the values for the two 
homozygous genotypes (aa and AA). Thus, the alternative 
hypothesis holds that the odds ratio for Aa vs. aa is the same 
as that for AA vs. Aa, and is not equal to one. The test does not 
formally depend on this assumption, since it is evaluated 
under the null hypothesis p = 0, but this is the alternative 
hypothesis for which its power is maximized. However, as 
we shall see, the score test described above is also the score 
test for a wider class of alternatives in which the risk for 
the heterozygous genotype is intermediate between those 
for the homozygous genotypes, so that the test will be lo- 
cally most powerful against this wider class. Perhaps re- 
flecting this, this score test is widely used in the analysis of 
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genome-wide association studies. Nevertheless, a two de- 
gree of freedom test against the wider class in which het- 
erozygous risk is unrestricted may easily be derived by cod- 
ing the SNP genotype as two indicator variables, X\ and x 2 
say. Such a test can be extended in exactly the same way as 
is proposed below for the one degree of freedom test. 

"A SURPRISING RESULT ..." 

A somewhat counter-intuitive aspect of this test was 
pointed out by Robinson and Jewell (1991), and has been 
discussed at some length in the epidemiological literature. 
Their "surprising result" is that, in a logistic regression anal- 
ysis in which X and Z are conditionally independent given 
Y but Y and Z are associated, inclusion of Z in the regression 
results in a loss of power as compared with the an analysis 
which simply ignores Z. Robinson and Jewell contrasted 
this with the case of a classical regression model where Y is 
a continuous, normally distributed response and X and Z 
are marginally independent, where the reduction in residual 
variance achieved by including Z in the model results in an 
increase in power. This latter situation fosters a strong intu- 
ition that one should always adjust for a covariate which is 
strongly related to response, even if it is not related to the 
predictor variable of interest. As is now well appreciated in 
the epidemiological literature, Robinson and Jewell's result 
shows that this is not the case for the analysis of case-control 
studies, where the sampling scheme leads to conditional in- 
dependence of X and Z given disease status, Y. Here, the use 
of the stratified test can result in appreciable loss of power. 

This has been noted by Ling Kuo and Feingold (2010) in 
the context of genetic association studies, where it is par- 
ticularly relevant. For example, in type 1 diabetes there is 
a long-established and very strong association with certain 
human leukocyte antigen (HLA) types yet, with the excep- 
tion of loci close to the major histocompatibility complex 
(MHC) region on chromosome 6, other genetic loci would 
not be expected to be related to HLA genotype at a popula- 
tion level. An example of an environmental covariate would 
be cigarette smoking in genetic association studies of lung 
cancer; again, we would expect most loci to be unrelated 
to smoking behaviour. While the loss of power resulting 
from the use of stratified tests can be avoided by matching in 
the design of the case-control studies, many genome-wide 
association studies make use of readily available, general 
purpose control groups. 

Some intuition concerning the loss of power due to strati- 
fication, and a clue to its possible recovery, are evident from 
consideration of the log-linear model. The log-linear model 
considered above which, when after conditioning on X and 
Z is equivalent to the logistic regression model, includes a 
term for association between X and Z (represented by the 
parameter set 4>. V :)- If/ under the null hypothesis, we may 
assume independence of X and Z, then this term can be 
omitted from the model. Then, in the expression for the 
expected frequencies e yxz , the ratios f. xz /f.. z , which estimate 
the distribution of X conditional on Z, can be replaced by the 
marginal estimates f. x . If.... Then the test statistic simplifies to 

This is the usual Cochran- Armitage test statistic, and it ig- 
nores the stratification, z. It can also be written x^y — NTy, 



where omission of the z subscript indicates disregarding 
of the stratification, and the appropriate argument for 
generating the variance of the test statistic (or an exact 
P -value) is permutation of the elements of x and y. In the 
special case where X is dichotomous, the test becomes 
Fisher's exact test for association in the 2x2 table. 

Thus, taking account of the independence of X and Z 
in the model avoids the loss of power of the stratified test 
as compared with the test which disregards the stratifica- 
tion, but it leads to no gain in power — the two tests become 
identical. Thus, the counter-intuitive suggestion that one 
should ignore the stratifying variable(s) would seem to be 
supported. However, as will be shown below, this result is 
model-dependent and is unique to the choice of the "canon- 
ical" logit link function. 

GENERAL LINK FUNCTIONS 

A more general approach is to assume, in the population, 
a GLM, with binomial errors and arbitrary link function gQ: 

g (Pr(Y = 1 1 X = x, Z = z)) = a + £x + y z . 

For example, the "liability threshold" model widely used in 
genetics can be represented in the above form by choosing 
the link function g((ji) = <J> 1 (jjl), the inverse of the Gaussian 
probability distribution function. In the case-control study, 
we still have a GLM, but with a modified link function, 
gccQ, which now involves the ratio of sampling fractions, 
K: 

£cc(Pr(y = l|X=x,Z = z, S)) 

/ p r (Y = 1 1 X = X, Z = z, S) \ 
~ 8 \K - (K - l)Pr(Y = 1 1 X = x, Z = z, S)) 
- a + fix + y z . 

The first derivative of the log-likelihood function from 
the case-control study now becomes 

9 log! Jl» 

i=i 

where, as before, |x, represents the fitted value of y, in the 
case-control study, and the weight function, w(), is given by 

w(|x) = [|x(l - nJgccGi.)] -1 . 

where g' cc Q represents the first derivative of gccO- In gen- 
eral, the weight function will depend on the ratio of sam- 
pling fractions, K . 

When the logit link function applies in the population, 
the weight function is constant. Otherwise the efficient score 
tests will be weighted versions of those derived in the pre- 
vious section. In particular, the statistic for the stratified test 
becomes 

U = ^2 w z (f lxz ~ x 

Z X 
z 
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where the stratum weights are 

a function of the ratio of cases to controls in each stratum. 
When the stratifying variable is strongly related to disease, 
these weights could vary substantially when a link other 
than the logit is assumed. The variance of this statistic under 
the null (permutation) variance is 

N 2 

Var(Lf) = J2w 2 zW ^S 2 (x z )S 2 (y z ). 

As before, when it can be assumed that X and Z are in- 
dependent of one another, a more powerful test can be pro- 
posed. We simply replace the estimate of the conditional 
distribution of X given Z by the marginal sample distribu- 
tion of X, and obtain 

= x T Wy-x.l T Wy. 

In the final expression, W represents a diagonal matrix with 
diagonal elements given by the appropriate w z . The appro- 
priate null distribution for this test is generated by random 
permutations of the elements of x and Wy. Its mean is zero 
and its variance is 

N 2 

V a r(U) = w -^SHx)SHWy). 

In effect, this test amounts to using the existing unstrati- 
fied test, but with the 0/1 case-control indicator replaced 
by either zero (controls) or w z (cases). Thus, no new soft- 
ware will usually be required. A regression generalization 
follows naturally; instead of stratifying by a single discrete 
risk factor, we could regress phenotype on several known 
risk factors to obtain "fitted values," for each subject. The 
appropriate weight function u;((x) would then be used to 
generate subject-specific weights for later association tests. 

WEIGHTS FOR PROBIT AND OTHER LINK 
FUNCTIONS 

The weight function derived in the last section is ex- 
pressed as a function of the fitted value for the case-control 
indicator variable, |x, and the first derivative of the link 
function in the case-control study, g' cc (^)- H can ^ e conve- 
nient to rewrite this in terms of the risk in the equivalent 
stratum of the population, tt = ^/[K — (K — 1)jul], and the 
first derivative of the link function in the population model, 

w(ii) = [m.(i - ^c^r 1 

= ba-Og'Cir)]- 1 . 

We shall denote this function by w p (tt). 

The weight functions corresponding to some important 
link functions are shown in Table I. 

The probit link is of special interest. Its weight function 
for risks in the range (0.001, 0.5) is shown in Figure 1. 



TABLE I. Stratum weights as a function of population 
risk 



Link 




wp(ir) 


Logit 




1 


Log 


log TT 


(1 - tt)- 1 


Probit 




exp-[* _1 (Tr)] 2 /2 
ir(l— it) 


Identity 


TT 


[tt(I-tt)]- 1 


Independence 


- log(l - it) 


TT" 1 


Power odds 







I 2- 




I 1 1 1 1 1 1 1 1 

0001 0 005 001 0 02 0.0S 0 1 02 0.5 



Risk (logit scale) 

Fig. 1. The probit link weight function for population risks in the 
range (0.001, 0.5), scaled relative to the weight for a risk of 0.01. 

It is approximately linear in logit(Tr), and strata in which 
the population risk is low are given somewhat greater 
weight. Another link of special interest is the "indepen- 
dence" link function which, as we shall see, is of special 
interest as a model for the absence of epistasis. In that case, 
the weight function is simply w p (tt) — tt -1 so that, again, 
greater weight is given to low-risk strata (although to a 
much greater extent than for the probit link). For small pop- 
ulation risks (the position we are usually faced with), the 
independence link is closely approximated by the identity 
link and the log link is closely approximated by the logit 
link. 

The "power odds" link function was discussed in the 
context of case-control studies by Breslow and Storer (1985). 
This general family of link functions leads to a wide range 
of weighting schemes: 

(1) \ = 0: (logistic model) the score test is unweighted; 

(2) X < 0: (risk accumulates less fast than the logistic model) 
low-risk strata are up-weighted in the test; and 

(3) A. > 0: (risk accumulates faster than the logistic model) 
high-risk strata are up-weighted in the test. 

This family of link functions is particularly convenient 
here in that tt/(1 — tt) can be replaced by |x/(l — u,), thus 
avoiding the need to estimate the sampling ratio. The 
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stratum weights are simply a power of the stratum-specific 
case:control ratios. The probit link function is not far from 
linearly related to the logit link, but most closely linear with 
the power odds link with X. = 0.12. 

POWER 

This section explores the extent to which the weighting 
scheme is likely to be important in real studies. Two scenar- 
ios are considered: 

(1) (Large risk differentials). The population is assumed to 
fall into seven equally populated risk strata with risks 
0.1, 0.05, 0.02, 0.01, 0.005, 0.002, and 0.001— a 100-fold 
variation. 

(2) (Moderate risk differentials). The population is as- 
sumed to fall into four equally populated risk strata 
with risks 0.05, 0.02, 0.01, and 0.005— a 10-fold varia- 
tion. 

In both cases, the non-centrality parameter of the Chi- 
squared test on 1 degree of freedom, LZ 2 /Var(LZ), was calcu- 
lated for a very small effect of a diallelic locus, with minor 
allele frequency 0.5, assumed constant across strata. 

Figure 2 shows the effect of misspecifying the link func- 
tion in the context of the power odds family. In the left-hand 
panel, datasets are generated using different values of X. and 
analysed with X = 0 (which, as shown above, is equivalent 
to the standard Cochran- Armitage test). The efficiency mea- 
sure plotted is the ratio of the non-centrality parameter for 
the Chi-squared (1 df) test to its value when the link is 
correctly specified. The lower curve refers to the more ex- 
treme scenario 1, and the upper to scenario 2. When the 
data are generated with X < 0, corresponding to accumu- 
lation of risk being faster than multiplicative, the standard 
Cochran- Armitage test can be quite inefficient. For positive 
values of X, corresponding to submultiplicative risk accu- 
mulation (with X = 1 approximating additivity of risks), the 
loss of power is less extreme, although it can still be sub- 
stantial. The right-hand panel of Figure 2 shows the same 
index when the data were generated according to the logis- 



tic regression model, but different values of X are used to 
calculate weights. 

The difference between logit and probit links is quite 
modest. The relative efficiency when using a Cochran- 
Armitage test when the data were generated by the probit 
(underlying liability) model is 0.949 in scenario 1 and 0.969 
in scenario 2. The corresponding values when the data were 
generated by the logistic model and analysed with probit 
weights are 0.983 and 0.984. 

As indicated earlier, a possible application of these meth- 
ods is the search for new disease susceptibility loci for type 
1 diabetes. Here there is a very strong HLA association and 
some suggestion in the literature that additional effects tend 
to contribute in a submultiplicative manner. Figure 3 shows 
the results of fitting the power odds model in a large case- 
control study (Clayton, 2009). 

For each value of X, the regression model including all 
known disease susceptibility loci was fitted using stan- 
dard generalized linear modelling software, and the fig- 
ure shows the profile log-likelihood function for X. The 
maximum-likelihood estimate was 0.05, corresponding to 
a (slightly) submultiplicative risk accumulation part-way 
between logit and probit links, although both of these links 
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Power, X 

Fig. 3. Log-likelihood for the parameter, \, of the power odds link 
for a case-control study of type 1 diabetes. 
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Fig. 2. Efficiency loss due to misspecification of the link function (power odds family). 
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would be rejected at at the 0.05 level by a likelihood ratio 
Chi-squared test. The model of additive risk contributions 
(X = 1) is clearly unsupported. 

INTERACTION AND EPISTASIS 

The concept of link functions is also highly relevant to 
the continuing debate concerning the role of "epistasis" in 
complex disease genetics. In particular, it is widely believed 
that both failure to find new disease associations and poor 
prediction from known loci are attributable to a failure to 
allow for epistasis. The results of previous sections have 
considerable relevance to this suggestion. 

It has frequently been pointed out that the genetics liter- 
ature is confused on the topic of epistasis(see, e.g., Phillips 
(1998) and Cordell (2002)), but misconceptions due to a con- 
fused use of language still remain. The term was coined, by 
Bateson, in the context of fully penetrant traits whose inher- 
itance could not be explained by action of a single locus. But, 
for traits which are not fully penetrant, the phenomenon be- 
comes much more difficult to define precisely. In consider- 
ing quantitative traits (which, by their nature, are not fully 
penetrant), Fisher (1918) introduced the term "epistacy" to 
refer to statistical interaction, or non-additivity of the ef- 
fects of two or more loci, in the same way as he used the 
term "dominance" to refer to non-additivity of the effects 
of the two alleles at a single autosomal locus. However, al- 
though there are clear analogies between the quantitative 
measures introduced by Fisher and the related concepts for 
fully penetrant traits, they are not the same thing and the 
resultant blurring of the language has not been helpful. In- 
deed, Fisher's epistacy has widely been referred to as epista- 
sis, adding to the confusion. The problem of defining these 
concepts in terms of statistical interaction is that statistical 
interaction is scale-dependent; if two factors act additively 
on one scale of measurement they will interact, in the statis- 
tical sense, on a different scale. The same problem has been 
faced by epidemiologists in considering the question of the 
independence of causes in multifactorial disease and the is- 
sue has been widely debated in that literature (Thompson, 
1991). 

For incompletely penetrant dichotomous traits, Risch 
(1990) considered the role of epistasis both in determin- 
ing the pattern of recurrence risks in extended families, and 
for the power of linkage studies. In these detailed stud- 
ies, he defined absence of epistasis in terms of locus het- 
erogeneity. This corresponds with the epidemiological con- 
cept of independent sufficient causes; if one factor acting 
alone causes a disease with probability -yj and a second 
factor, also acting alone, causes the disease with proba- 
bility 72/ then to remain free of disease one must avoid 
both causes and, if the causes are statistically indepen- 
dent, (1 - tt) = (1 - "yiXl - 7 2 ). Writing ft = - log(l - 7, ), 
we then have 

-log(l-Tr) = Bi + p2. 

This is an additive model with link function — log(l — tt) 
(referred to above as the "independence" link function). 
For small tt, the usual position for disease traits, the inde- 
pendence link is closely approximated by the identity link. 
Thus, in the model of Risch, absence of epistasis corresponds 
closely to additivity of effects on risk itself. 



In contrast, Risch chose to model epistatic action of two 
factors in terms of the multiplicative model for risks. A 
heuristic for this model is to consider the case in which the 
two causes considered in the previous paragraph must both 
occur for the disease to be penetrant. Then, again assuming 
independence of causes, tt = 7172 and, writing ft instead of 
log 7i/ 

log-n- = Pi + ft, 

a model for additivity of effects on the log scale. Risch 
showed that relative recurrence risks fell much more 
sharply with distance of relationship to proband under 
this epistatic model than under the additive, non-epistatic 
model. 

When the attention of scientists studying the genetics of 
complex disease switched to association studies, in particu- 
lar to case-control studies, the obvious statistical approach 
was regression analysis and, for reasons given earlier, the 
most natural choice of regression model is logistic regres- 
sion. Many authors, perhaps following Fisher, then started 
to use the term epistasis to refer to statistical interaction in 
the logistic regression model. Since, for small tt the logit 
link is nearly the same as the log link, from this standpoint 
absence of epistasis corresponds to the model of multiplica- 
tive effects — precisely the model used by Risch as a model 
for epistatic action. 

When, in some fields, the strategy of testing loci for as- 
sociation one-at-a-time led to disappointing results, the ar- 
gument was frequently advanced that, since complex dis- 
eases clearly must involve interactions of causes in a mech- 
anistic sense, association tests should allow for epistasis. 
Here, however, the results of previous sections are instruc- 
tive; these show that the strategy of disregarding other loci 
when testing a new locus of interest is optimal when there 
is no interaction on the logit scale, and this corresponds 
with Risch's model for epistasis. Ironically, with the dis- 
ease model of independent sufficient causes, Risch's model 
for absence of epistasis, the strategy of ignoring other loci 
is no longer optimal; subjects carrying high-risk alleles at 
other loci should be given less weight in the analysis. How- 
ever, Figure 2 shows that very substantial loss of power 
due to the one-at-a-time strategy only occurs when effects 
are supra-multiplicative (A. < 0), and it is not clear whether 
such interaction is widespread for complex traits. 

In the context of the model of an underlying normally 
distributed liability, it is natural to regard interaction on the 
probit scale as a natural generalization of Fisher's "epis- 
tacy." But there would seem to be no reason to believe 
that such interaction represents epistasis in any mechanistic 
sense. In the simple conceptualization of the model of no 
interaction on the probit scale, different causes contribute 
independently and additively to some underlying latent 
construct (liability), and disease follows when this exceeds 
a threshold. But this does not correspond to a model of inde- 
pendent sufficient causes; indeed, since the causes are acting 
in concert within a single mechanism, this model could be 
thought of as a model of epistasis rather than a model for 
no epistasis. 

PREDICTION 

The question of the ability to predict, from their geno- 
types, those individuals who will develop various complex 
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diseases has important implications for the future of health 
care, and has received some recent attention. Much depends 
on the manner in which risk accumulates due to the accu- 
mulation of multiple risk alleles. The link function provides 
a simplified way of addressing this problem, hopefully to 
guide intuition. The question of interest is, for a disease 
with a given heritability (loosely defined), just how pre- 
dictable would disease occurrence be if we typed every dis- 
ease susceptibility locus responsible? Here, the measure of 
heritability used will be the sibling relative recurrence risk, 
Xs, defined as the ratio of the risk to the sibling of an affected 
proband to the general population risk, and predictability 
will be the receiver operating characteristic (ROC) curve — 
the plot of the true positive probability (sensitivity) vs. the 
false positive probability for different thresholds for a geno- 
type score. 

THE MULTIPLICATIVE (LOG LINK) MODEL 

A multiplicative model for risks, appropriate for rare dis- 
eases, assumes that the multiple genes contribute to a ge- 
netic risk score, assumed to be approximately normally dis- 
tributed in the population, risk being related to this genetic 
risk score via a log-link function. This model has been ex- 
plored by Pharoah et al. (2002) and by Clayton (2009). 

If the distribution of a genetic score, x, in the population 
is N(0,1), and the risk conditional on x is given by 

log tt = a + 

then the marginal population risk of disease is P — exp(a + 
P 2 /2) and the sibling recurrence risk ratio is Xs — exp((3 2 /2). 
The distribution of the genetic risk score conditional upon 
disease occurrence is N((3, 1), thus allowing plotting of ROC 
curves (Figure 4). 

This model has been criticized by Wray et al. (2010), 
largely on the grounds that it can lead to predicted risks 
which exceed 1.0. However, this is unlikely to be a seri- 
ous problem for most complex disease traits. For exam- 
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Fig. 4. ROC curves for the log-link model for Xs of 1.5, 2, 3, 5, and 
10. The population risk, P, is taken as 0.005. 



pie, for P = 0.005 and Xs = 5, only a population fraction of 
5.9 x 10~ 5 would have predicted risks greater than 1.0. 

THE INDEPENDENT SUFFICIENT CAUSES 
MODEL 

A tractable independent sufficient causes model as- 
sumes 

(1) the number of risk variants, x, inherited by an individ- 
ual follows a Poisson distribution with mean |x; 

(2) each variant, acting alone, would have penetrance p; 
and 

(3) there exists a background non-heritable (sporadic) risk, 
b, acting as a further independent sufficient cause. 

Under the independent sufficient causes model, the causes 
act additively with link function — log(l — tt): 

- log(l - it) = - log(l -b)-x log(l - p) 

and, summing over the Poisson distribution of x, the 
marginal population risk becomes 

p = i - (i - iy-w. 

The numbers of risk alleles inherited by two siblings, X\ and 
x%, may be decomposed into a shared set and two residual 
unshared components, these three counts following Poisson 
distributions with means p,/2. Under this model, the sibling 
recurrence risk ratio becomes 

x s = i + (^-i)V" 2/2 -D. 

The ROC curves for this model may be simply described; for 
even moderately high values of Xs (> 2, say), the curve steps 
to (1 — b) in a single step, thereafter following a straight 
line to the (1, 1) corner. In effect, the non-sporadic cases are 
predicted perfectly. 

This behaviour reflects the fact that the model, in effect, 
reverts to the single gene model with locus heterogeneity. 
To explain relatively high values of X s , the penetrance of 
each causal genetic variant must be high and, to explain the 
relatively low population prevalence of disease, these vari- 
ants must be rare. As a result, non-sporadic cases will have 
at least one variant (and rarely more than one), while non- 
cases will be very unlikely to carry any such variants. Pre- 
diction from genotype will be extremely accurate. However, 
when the value of Xs is little above one, this is no longer the 
case and prediction is little better than when a multiplica- 
tive model holds. For example, with |jl = 5, P — 0.005 and 
no sporadic cases (b — 0), this model yields Xs = 1.1. The 
mean number of causal variants carried by cases is ~ 6 and, 
by non-cases, ~ 5. The ROC curve is little different from that 
of the log-link model for the same value of X s (Figure 5). 

THE POWER LINK 

Further light is cast on the impact of the link function on 
the ability to predict a disease trait from genotype by con- 
sidering the power transformation. For simplicity, consider 
a rare disease so that the power odds link can be replaced 
by the the power transformation of risk itself so that 

1 

-(tt x -1) = a+$x. 
A 
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Fig. 5. ROC curves for the independent sufficient causes link 
(solid line) and the log link (dashed line) for A s = 1.1, P = 0.005. 
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Fig. 6. ROC curves for power link. Solid line has A = 0 (multi- 
plicative model), dashed line has A. = —0.05, and dotted line has 
A. = +0.1 (A s = 5, P = 0.005). 



For small A, this can be approximated by a quadratic model 
with log link: 

logiT^a + px- -(ct+pxf. 

When the population distribution of x is normal and risks 
are small, this yields tractable expressions for relative recur- 
rence risks and for ROC curve calculations. Figure 6 shows 
ROC curves for the multiplicative model (A. = 0) and for 
small positive and negative deviations from it. 

For the submultiplicative model (A. = +0.1), prediction is 
improved with respect to the multiplicative model while, 



for the supra-multiplicative model (A = —0.05), prediction 
is worse. 

THE PROBIT MODEL 

The probit model also leads to tractable estimates for the 
ROC curve given availability of an algorithm to estimate the 
bivariate normal integral (Wray et al, 2010). If genotype and 
environment scores, g and e say, are distributed as N(0,1) 
and heritability of liability is H then liability, defined as 

gVHg + e-v/1 — H, is also N(0,1). For marginal population 
risk P, the liability threshold is q = 4> _1 (1 — P) and, since 
the correlation between liabilities of two siblings is H/2, the 
sibling recurrence risk ratio is A s = T(q , q ; H/2)/P 2 , where 

/oo poo 
j 4>(w, v;r)du dv, 

<j) denoting the bivariate normal distribution function with 
unit variances and correlation r . Since genotype score, g, has 
correlation \fH with liability, the cumulative distribution 
function of genotype score in cases is 

f(s) = i-^M;VH). 

Figure 7 compares some ROC curves for different val- 
ues of A s and P. For A s = 5, as was pointed out by Wray 
et al. (2010), prediction improves markedly with increasing 
population prevalence P and, even when P is very low, is 
significantly better than in the case where the multiplica- 
tive model holds. The approximate curve for the power link 
approximating to the probit (A = +0.12) provides a fair ap- 
proximation. When As = 2, the five curves are much closer 
to one another. 

The explanation for these results follows from the form 
of the probit link. We would expect to see a similar pat- 
tern for the logit link, another sigmoid function (although 
this case does not lead to tractable calculations). For small 
risks, these models are close to the multiplicative model for 
risks while, for larger risks the probit and logit links become 
nearly linear, approximating the additive model for risk. For 
risks close to one, the models are close to the independent 
sufficient causes model (i.e., multiplicative in — log(l — tt)). 
As we have seen, prediction is much better when an ad- 
ditive or independent sufficient causes model holds. Thus, 
prediction can be better under the probit model (and, one 
would presume, under a logistic model) because, for larger 
risks, it becomes closer to the additive risks model. 

DISCUSSION 

We have shown that the strategy of ignoring other known 
disease susceptibility loci and risk factors when testing 
for new associations with complex disease, for example 
in genome-wide association studies, is justifiable, but only 
when effects combine additively on the logistic scale. More 
generally, weighted analyses may be appropriate, but this 
raises the question of how the weighting scheme might be 
chosen. This problem is ubiquitous in the choice of statis- 
tical tests: the optimal choice of test depends on the un- 
known state of nature. We have past experience to guide 
us so that, as in the case of type 1 diabetes (Figure 3), it 
is possible to estimate a link function by choosing from a 



Genet. Epidemiol. 



Link Functions in Multi-Locus Genetic Models 



417 




-i 1 1 1 1 r- 1 ■— i 1 1 1 1 r- 

0.0 02 0.4 06 08 1.0 0.0 02 0.4 0.6 08 1.0 



Fabe pozi w rate Fabc posiiw rate 

Fig. 7. ROC curves for probit model for A. s = 5 (left) and A. s = 2 (right), and for P = 0.01, 0.001, and 0.0001 (solid lines, top to bottom). 
The dotted line is for the multiplicative model with P = 0.0001, and the dashed line is for the corresponding approximate power link 
model with power, \ = +0.12. 



general family such as the power-odds family using current 
data. Even here, however, the fact that one link function is 
generally most appropriate does not guarantee that it will 
always be optimal. Although, estimation of the link func- 
tion from available data would probably be advocated as 
the best strategy, in the analysis of a genome-wide study 
there is a case for repeating the analysis with several plausi- 
ble link functions. Although this could marginally increase 
the false positive rate such studies will, in any case, require 
replication. 

It is commonly believed that underlying mechanisms for 
such diseases must involve "epistatic" action and, there- 
fore, that statistical interactions must be widespread. How- 
ever, the concept of "epistasis" cannot be simply identified 
with statistical interaction. Indeed, the logistic regression 
model with no statistical interaction between genes is quite 
strongly epistatic. The need to allow for other factors in 
carrying out association tests is particularly pressing in the 
presence of epistasis which is "supra-multiplicative," in the 
sense that the joint effect of multiple factors exceeds the 
product of their effects when acting alone. 

The precise scale on which multiple factors combine can 
also be important for assessing the potential for prediction 
of a trait from genotype, given its heritability (as assessed by 
recurrence risk ratios). Although the link function does not 
seem to be very important for a trait which is not very heri- 
table, it can have quite an influence for a strongly heritable 
trait. If factors act supra-multiplicatively phenotype is less 
predictable from genotype while, when joint effects are less 
than the product of single effects, phenotype is potentially 
more predictable from genotype. These considerations go 
some way to explaining the differing instincts of geneticists, 
who would expect a phenotype with high monozygotic twin 
concordance rate to be highly predictable, with those of epi- 
demiologists familiar with the "prevention paradox" (Rose, 
1992), who are more sceptical — in many multifactorial dis- 
eases, most cases derive from the bulk of the population at 
only modestly increased risk. It is too early to say which 
viewpoint will be appropriate for most common complex 



traits, although an epistatic model might be judged rather 
more plausible in most cases and this would favour the 
more sceptical view. 
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